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CACHE FLUSHING 
Background 

This invention relates to cache flushing. 

A cache is flushed in order to make sure that the contents 
of the cache and a main memory are the same. Typically, a cache 
flush is initiated by a processor issuing a flush command. A 
cache controller will then write back data from the cache into 
the main memory. 

Summary of the Invention 
Portions of a cache are flushed in stages. An exemplary 
flushing of the present invention comprises flushing a first 
portion, performing operations other than a flush, and then 
flushing a second portion of the cache. The first portion may be 
disabled after it is flushed. The cache may be functionally 
divided into portions prior to a flush, or the portions may be 
determined in part by an abort signal. The operations may access 
either the cache or the memory. The operations may involve 
direct memory access or interrupt servicing. 

Brief Description o f the Drawings 
FIG. 1 is a block diagram of a processor system 
incorporating the invention. 

FIG. 2 is a flow chart illustrating a flush of the cache of 

FIG. 1. 

FIG. 3 is a flow chart illustrating a flush of the cache of 
FIG. 1 combined with a processor transition. 

FIG. 4 is a flow chart illustrating another embodiment of 

the cache flush. 

FIG. 5 illustrates various LI and L2 flush combinations. 
FIG. 6 is a flowchart for an abort/retry flush with 
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processor transition. 

FIG. 7 is a flowchart of microcode for an abbrtable cache. 
FIG. 8 is a flowchart of microcode for a flush loop. 

Detailed Description 
In Figure 1, processor system 10 may be the core of a 
personal computer. Processor 20 uses caches 30 and 40 in 
conjunction with memory 50 to speed up data access. Cache 30, 
the LI cache, is typically smaller and faster than cache 40, the 
L2 cache. Both caches provide faster access than memory 50. 
Cache 30 is, for example, 32 Kbytes and operates at the processor 
speed. Cache 40 is, for example, 512 Kbytes and typically 
operates at about 50% of the processor speed. For example, a 
processor 20 operating at 450 MHZ may have a cache operating at 
225 MHZ. On such a system, memory 50 may be operating on bus 100 
15 with a bus speed of only 100 MHZ. Processor 20 can access memory 
50 directly over bus 100. Memory controller 70 arbitrates access 
to memory 50. One function of memory controller 70 is to grant 
peripherals 90 direct memory access (DMA) to memory 50 via bus 
100. Peripherals 90 may also be granted access to memory 50 by 
signaling interrupt handler 80. Interrupt handler 80 would then 
cause processor 20 to service the requested interrupt. 

As one example of a peripheral, a digital video camera may 
be connected to bus 100 via a Universal Serial Bus (USB) 
interface (not shown) . A camera or other peripheral may require 
periodic access to bus 100 in order to store information, such as 
video data, in memory 50. Certain peripherals may require access 
to memory 50 about every 200 microseconds or even more often. A 
long cache flush is broken up into stages so that the peripheral 
will be able to transfer data smoothly to or from the memory. 
For example, the processor may be interrupted in order to create 
or rearrange data buffers during a memory-peripheral transfer. 
The cache flush is. broken up into stages to allow for timely 
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servicing of the interrupt. 

Cache 30 and 40 are controlled by cache controller 60. One 
function of cache controller 60 is to perform cache write backs. 
Upon receipt of a flush command from processor 20, cache 
controller 60 will write back data from the flushed cache to 
memory 50. The entire cache may then be written back. 
Alternatively, a subset of the cache may be written back that 
includes those areas of the cache where the memory and cache are 
not identical. Cache controller 60 and the cache 30 are on the 
same chip as processor 20. Cache 40 may also be on the chip with 
processor 20. 

Cache 40 is shown, for example, as having three functional 
portions 42, 44, and 46. Portion 46 is smaller than portions 42 
and 44. Each portion may be flushed and disabled independently 
of the status of the other two portions. Cache 40 may therefore 
be flushed in stages. For example, first portion 42 may be 
flushed and then disabled. Then portion 44 may be flushed and 
then disabled. Finally potion 46 may be flushed. In between 
successive flushes, the processor system performs memory 
intensive operations. For example, peripherals 90 may be allowed 
to store data in memory 50, or processor 20 may service 
interrupts responsive to messages from interrupt handler 80. 

As shown in FIG. 2, in a staged L2 cache flush, the cache is 
functionally identified (210) as having two or more portions. A 
portion is selected and flushed (220). The flushed portion is 
then disabled (230) . The remaining portions of the cache remain 
enabled. During the cache flush (220), processor 20 could not 
service interrupts. Once the flushed portion has been disabled, 
the processor can service interrupts. These interrupt service 
routines may access the cache or memory locations that are 
mirrored in the cache (240) . 

After the interrupts are serviced, an unflushed portion of 



the cache is flushed (250). If the cache is divided into more 
than two portions, this portion of the cache is disabled and 
interrupts may be serviced again. Once the entire cache has been 
flushed, the entire cache is reenabled. Flushing the L2 cache in 
5 stages allows for faster response time to interrupts, DMA 
requests, or other memory requests requiring servicing. 

The size (the amount of data storage capacity) of the 
portions of the cache can vary, and may be configurable. The 
ratio of the sizes of the various portions may be a power of two. 
10 For example, first portion may be four or eight times the size of 
the second portion, or, in another example, a three portion cache 
may have portion ratios of 8:4:1. 

FIG. 3 illustrates a staged L2 cache flush in conjunction 
with a processor transition. For example, processor 50 may be 
15 operating (310) in the CI processor state within the GO operating 
state as defined in the Advanced Configuration and Power 
Interface Specification ("ACPI"), Revision 1.0b, released 
February 2, 1999. Portions of the ACPI are reprinted in the 
Appendix. One or more computer programs, for example, operating 
20 system software, application software, microcode, or otherwise, 
are stored on a computer readable medium and are executed on 
processor 50. Processor 50 requests, in response to either user 
input, system status information, or otherwise, to enter the C3 
state, also within the GO operating state. Processor 50 
25 selects (320) a different GO state. In this example, the C3 

state. Processor 50 then signals cache controller 60 to flush 
(330) a first portion of the L2 cache. Once the flush is 
complete, the first portion is disabled (340), and the interrupts 
are' serviced (350) . The second and last portion of the cache is 
30 then flushed (360) . The last portion to be flushed is usually 
smaller in size than the first portion. Accordingly, the flush 
of this last portion is faster than the flush of the first 
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portion. The second portion is disabled (370). Processor 50 
then transitions (380) from the CI state to the C3 state. 
Interrupts cannot be serviced during a cache flush or during the 
transition. In some cases, peripherals may not access memory 
5 during those times. Likewise, access must be denied between the 
final flush and the transition. By making the final flush brief, 
the flush can be immediately followed by a transition without 
significantly impacting the total duration between memory or 
cache accesses. Following the transition, interrupts and caching 

10 are enabled (390) . 

In another embodiment shown in FIG. 4, the behavior of a 
cache flush instruction is modified. The cache flush instruction 
WBINVD (Write Back and Invalidate) ordinarily flushes the entire 
cache. After interrupts and caching are disabled (405), the 
15 behavior of the system in response to a WBINVD instruction is 
modified (410) to flush half of the L2 cache. The WBINVD is 
executed (415) to flush the LI cache and one half of the. L2 
cache. The behavior of the system is then reconfigured (420) so 
that execution of the WBINVD instruction will flush the entire L2 
20 cache. Once the flushed half of the L2 cache is" disabled (425), 
interrupts and caching are enabled (430). Interrupts are then 
serviced (435) . Interrupts and caching are again disabled (440) 
and the LI cache and remainder of the L2 cache are flushed (445) . 
Caching and interrupts are enabled (455) and interrupts are 
25 serviced again (460) . 

The final flush and transition now takes place. Caches and 
interrupts are disabled (465), the LI cache is flushed (470) with 
the WBINVD instruction, the processor transitions (475) from one 
GO state to another. Once the transition is complete, the entire 
30 L2 cache is enabled (480). Interrupts and caching are also 
enabled (485) . 

The order of cache flushing is variable FIG. 5 illustrates 
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some possibilities: First the L2 is flushed (510) and then 
disabled (520). Interrupts are serviced (530) or other 
operations performed before the next flush. Then the LI cache is 
flushed (540). Also, a series of caches within a cache hierarchy 
may be flushed in stages. In such a circumstance, a portion of 
the cache hierarchy is flushed by flushing one or more caches 
within a cache hierarchy. Two examples of a staged flush of a 
cache hierarchy are shown in FIG. 5. 

The L2 cache can be flushed in stages, followed by a flush 
of the LI cache. Again interrupts can be serviced or other 
operations performed in between the three flushes. In either 
case, the Ll cache flush can be followed by a transition from one 
processor state to another. 

Another embodiment of the invention has an abortable L2 
cache so that a high latency L2 cache flush may be aborted, and 
then resumed. The individual flushable portions of the cache are 
not predetermined. Rather, the size of the portions vary 
depending on the time between flush abort signals. Upon receipt 
of an abort signal, the flush may be aborted immediately, be 
aborted after a specified or predetermined time period, or be 
aborted after the flushing of a portion of the cache of a 
predetermined size. An indicator is stored indicating how much 
of the cache was flushed prior to the abort. The indicator is, 
for example, stored in a register named Flush MSR (Flush Model 
Specific Register) . The indicator stores the last flushed 
segment, half -segment, word, or some other value. Software can 
use this register to communicate with the WBINVD microcode flow 
to tell it whether to use standard semantics or abort/retry 
semantics. Software retries the WBINVD until the L2 cache is 
completely flushed, providing windows between each iteration for 
the processor to service interrupts. With reference to FIG. 1, 
the flush MSR stores the boundary between portion 42 and 44 after 



portion 42 is flushed. After the first retry, the Flush MSR 
stores the boundary between portions 44 and 46. 

The processor is allowed to handle interrupts in between the 
flush abort and flush retry. When the flush is retried, the LI 
cache is flushed, where needed, then the L2 abortable cache 
starts flushing where the prior flush left off. In a particular 
embodiment, the L2 cache is only aborted after 250 usee following 
the start of the flush. The cache flush abort/retry control may 
be implemented in either microcode or hardware state machines. 

The cache flush abort/retry scheme is a generic 
architectural feature and could be used by a device driver or 
operating system where there is a need for reduced cache flush 
latency. The availability of this feature can be indicated by a 
feature flag in, for example, the processor. For microcode 
implementations, a microcode visible control register is used to 
accomplish a retriable flush. The microcode also uses the 
following hardware features: a) a saturating down-counter to know 
when the initial no-abort time has passed; b) a bit that the 
microcode can poll to determine if an interrupt is pending; c) a 
bit that blocks the interrupt pending bit; and d) a bit that the 
microcode can set that controls whether the L2 allocates new 
lines or not. 

In an embodiment, the counter uses the bus clock as its time 
reference. The periods for the bus clock frequencies are 15 
nsec, 10 nsec, 7.5 nsec, and 5 nsec for 66 MHZ, 100 MHZ, 133 MHZ, 
and 200 MHZ respectively. The counter counts 2.5 nsec time 
intervals and decrements by 6, 4, 3, or 2 ticks per 66 MHZ, 100 
MHZ, 133 MHZ, and 200 MHZ bus clock respectively. The interrupt 
pending bit is set if there is an external interrupt, the counter 
has reached zero, and the blocking bit is not set. 

In FIG. 6, an abortable L2 cache, flushed prior to a 
processor state transition, is controlled by microcode. 



Interrupts and caching are disabled (610). The Flush MSR is 
initialized (e.g. set to 0x80000000) (615). The WBINVD 
instruction is executed (620). If the L2 cache is completely 
flushed (Flush MSR equals, zero), then a processor transition is 
performed (630). Interrupts and caching are enabled (635). 

If the Flush MSR was not equal to zero, then the Flush MSR 
value is saved and Flush MSR is cleared (645). Interrupts and 
caching are enabled (650) and interrupts are serviced (655). 
Interrupts and caching are then disabled (660), the Flush MSR is 
restored (665) and the flush is retried by executing WBINVD again. 
(620). 

Execution of a WBINVD instruction with an abortable cache 
flush triggers execution of the microcode flow shown in FIG. 7. 
First the processor is prepared (706) for a flush operation. The 
L2 cache is then set to "No Allocate" mode (712). In "No 
Allocate" mode, the processor will be able to access and use the 
unflushed portions of the cache, but new data from memory will 
not be written into the cache, even if the processor must fetch 
data from memory. 

The "no-abort" timer is initialized to the "no-abort" time . 
period (714). Until this timer expires, abort requests will be 
postponed. The LI instruction caches are invalidated (716) . A 
single segment of code is used for both the LI and abortable L2 
cache. The code must be set up for the proper cache. The 
progress count is initialized to zero so that the LI cache flush 
starts from the beginning of the cache, and flush aborts are 
blocked (718). The LI data cache is then flushed (720). If the 
L2 cache is not being flushed (722), the processor is prepared 
for regular instruction execution (724). The Flush MSR is set to 
zero and the L2 cache is set to allocate mode (728) . In this 
configuration, the next flush will be a regular flush. A message 
is generated indicating that the flush has been completed (730) . 
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If the L2 cache is being flushed, the size of the L2 cache 
is determined and flush aborts are unblocked (734). The 
following sequence flushes an eight way L2 cache in two halves of 
four ways each. First, if ways 0 to 3 are enabled and not flushed 
5 (736) then ways 0 to 3 are flushed (740) . The Flush MSR is set 
(742) to indicate that first half of the cache is flushed. If 
ways 4 to 7 are enabled (744) then the progress count is 
initialized from the Flush MSR register. If the flush had 
previously been aborted during the flush of the second half of 

10 the cache, then the flow would have come from block 736 directly 
to block 744 and to block 746. In this case, the Flush MSR would 
hold the location where the last flush was aborted. Ways 4 to 7 
are flushed (748). Alternatively, all ways of an N-way cache can 
be flushed in one functional block, e.g. block 740. The flow 

15 then branches to block 724. 

The flush loop shown in FIG. 8 is executed by the Flush 
blocks (e.g., blocks 720, 740, and 748) from FIG. 7. The loop 
starts at block 802. The progress count is subtracted from the 
set count(804). This value is placed in the set counter. The 

20 progress count is indicative of how much of the cache has been 

flushed. It is derived from the Flush MSR. The set count is the 
total number of sets in the cache. The set pointer is then set 
to point to the next set to be flushed (806) . The ways of the 
current half set is flushed (808-814). If all sets are flushed 

25 then control returns to the flush flow in FIG. 7. 

If sets remain to be flushed (816) then the set counter is 
decremented (820) . The set pointer is incremented to the next 
set (822). The progress count is also incremented (824). This 
keeps track of the progress of the flush in case of an abort. If 

30 there is no pending interrupt then the flush continues at block 
808. If there is a pending interrupt, the processor is prepared 
for normal execution (828) . The progress count is saved in the 
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Flush MSR (832) and the flow exits from the WBINVD instruction 
(834) . Interrupts may now be serviced. 

Other embodiment are within the scope of the following 
claims. 
APPENDIX 

2.2 Global System State Definitions 
. . * 

GO - Working: 

A computer state where the system dispatches user mode 
(application) threads and they execute. In this state, devices 
(peripherals) are dynamically having their power state changed. 
The user will be able to select (through some user interface) 
various performance/power characteristics of the system to have 
the software optimize for performance or battery life. The system 
responds to external events in real time. It is not safe to 
disassemble the machine in this state. 

4.7.2.6 Processor Power State Control 

ACPI supports placing system processors into one of four 
power states in the GO working state. In the CO state the 
designated processor is executing code; in the C1-C3 states it is 
not. While in the CO state, ACPI allows the performance of the 
processor to j be altered through a defined "throttling" process 
(the CO Throttling state in the diagram below) . Throttling 
hardware lets the processor execute at a designated performance 
level relative to its maximum performance. The hardware to enter 
throttling is also described in this section. 

In a working system (global GO working state) the OS will 
dynamically transition idle CPUs into the appropriate power 
state. ACPI defines logic on a per-CPU basis that the OS uses to 
transition between the different processor power states. This 
logic is optional, and is described through the FACP table and 
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processor objects (contained in the hierarchical name space) . 
The fields and flags within the FACP table describe the 
symmetrical features of the hardware, and the processor object 
contains the location for the particular CPU's clock logic 
5 (described by the P_BLK register block) . The ACPI specification 
defines four CPU power states for the GO working state : CO, CI, 
C2 and C3. 

In the CO power state, the processor executes. 
In the CI power state, the processor is in a low power state 
10 where it is able to maintain the context of the system caches. 
This state is supported through a native instruction of the 
processor (HLT for IA-PC processors), and assumes no hardware 
support is needed from the chipset. 

In the C2 power state, the processor is in a low power state 
15 where it is able to maintain the context of system caches. This 
state is supported through chipset hardware described in this 
section. The C2 power state is lower power and has a higher exit 
latency than the CI power state. 

In the C3 power state, the processor is in a low power state 
20 where it is not necessarily able to maintain coherency of the 
processor caches with respect to other system activity (for 
example, snooping is not enabled at the CPU complex ) . This state 
is supported through chipset hardware described in this section. 
The C3 power state is lower power and has a higher exit latency 
25 than the C2 power state. 

The P_BLK registers provide optional support for placing the 
system processors into the C2 or C3 states. The P_LVL2 register 
is used to sequence the selected processor into the C2 state, and 
the P_LVL3 register is used to sequence the selected processor 
30 into the C3 state. Additional support for the C3 state is 

provided through the bus master status and arbiter disable bits 
(BM STS in the PM1_STS register and ARB_DIS in the PM2_CNT 
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register) . System software reads the P_LVL2 or P_LVL3 registers 
to enter the C2 or C3 power state. Hardware is required to put 
the processor into the proper clock state precisely on the read 
operation to the appropriate P_LVLx register. 
5 Processor power state support is symmetric, all processors 

in a system are assumed by system software to support the same 
clock states. If processors have non-symmetric power state 
support, then the BIOS will choose and use the lowest common 
power states supported by all the processors in the system 
10 through the FACP table. For example, if the P0 processor supports 
all power states up to and including the C3 state, but the PI 
processor only supports the CI power state, then the ACPI driver 
will only place idle processors into the CI power state (P0 will 
never be put into the C2 or C3 power states) . Note that the CI 
15 . power state must be supported; C2 and C3 are optional, (see the 
PR0C_C1 flag in the FACP table description in section 5.2.5). 

4.7.2.6.1 C2 Power State 

The C2 state puts the processor into a low power state 
optimized around multiprocessor (MP) and bus master systems. The 
system software will automatically cause an idle processor 
complex to enter a C2 state if there are bus masters or MP 
processors active (which will prevent the OS from placing the 
processor complex into the C3 state) . The processor complex is 
able to snoop bus master or MP CPU accesses to memory while in 
the C2 state. Once the processor complex has been placed into the 
C2 power state, any interrupt (IRQ or reset) will bring the 
processor complex out of the C2 power state. 

4.7.2.6.2 C3 Power State 
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The C3 state puts the designated processor and system into a 
power state where the processor's cache context is maintained, 
but it is not required to snoop bus master or MP CPU accesses to 
memory. There are two mechanisms for supporting the C3 power 
state: 

Having the OS flush and invalidate the caches prior to 
entering the C3 state. 

Providing hardware mechanisms to prevent masters from 
writing to memory (UP only support). 

In the first case the OS will flush the system caches prior to 
entering the C3 state. As there is normally much latency 
associated with flushing processor caches, the ACPI driver is 
likely to only support this in MP platforms for idle processors. 
Flushing of the cache is through one of the defined ACPI 
mechanisms (described below, flushing caches) . 

In UP only platforms that provide the needed hardware 
functionality (defined in this section), the ACPI driver will 
attempt to place the platform into a mode that will prevent 
system bus masters from writing into memory while any processor 
is in the C3 state. This is done by disabling bus masters prior 
to entering a C3 power state. Upon a bus master requesting an 
access, the CPU will awaken from the C3 state and re-enable bus 
master accesses. 

The ACPI driver uses the BM_STS bit to determine which Cx 
power state to enter. The BM_STS is an optional bit that 
indicates when bus masters are active. The ACPI driver uses this 
bit to determine the policy between the C2 and C3 power states: 
lots of bus master activity demotes the CPU power state to the C2 
(or CI if C2 is not supported) , no bus master activity promotes 
the CPU power state to the C3 power state. The ACPI driver keeps 
a running history of the BM_STS bit to determine CPU power state 
policy. 



The last hardware feature used in the C3 power state is the 
BM_RLD bit. This bit determines if the Cx power state is exited 
based on bus master requests. If set, then the Cx power state is 
exited upon a request from a bus master; if reset, the power 
5 state is not exited upon bus master requests. In the C3 state, 
bus master requests need to transition the CPU back to the CO 
state (as the system is capable of maintaining cache coherency) , 
but such a transition is not needed for the C2 state. The ACPI 
driver can optionally set this bit when using a C3 power state, 
10 and clear it when using a C1-C2 power state. 

What is claimed is: 
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